8 research outputs found

    Near-optimal replacement policies for shared caches in multicore processors

    Get PDF
    An optimal replacement policy that minimizes the miss rate in a private cache was proposed several decades ago. It requires knowing the future access sequence the cache will receive. There is no equivalent for shared caches because replacement decisions alter this future sequence. We present a novel near-optimal policy for minimizing the miss rate in a shared cache that approaches the optimal execution iteratively. During each iteration, the future access sequence is reconstructed on every miss interleaving the future per-core sequences, taken from the previous iteration. This single sequence feeds a classical private-cache optimum replacement policy. Our evaluation on a shared last-level cache shows that our proposal iteratively converges to a near-optimal miss rate that is independent of the initial conditions, within a margin of 0.1%. The best state-of-the-art online policies achieve around 65% of the miss rate reduction obtained by our near-optimal proposal. In a shared cache, miss rate optimization does not imply the optimization of other metrics. Therefore, we also propose a new near-optimal policy to maximize fairness between cores. The best state-of-the-art online policy achieves 60% of the improvement in fairness seen with our near-optimal policy. Our proposals are useful both for setting upper performance bounds and inspiring implementable mechanisms for shared caches.The authors acknowledge support from grants (1) PID2019-105660RB-C21 and PID2019-107255GB-C22 from Agencia Estatal de Investigación (AEI) from Spain and European Regional Development Fund (ERDF); (2) gaZ: T58_20R research group from Aragón Government and European Social Fund (ESF); and (3) 2014-2020 "Construyendo Europa desde Aragón" from European Regional Development Fund (ERDF).Peer ReviewedPostprint (author's final draft

    STT-RAM memory hierarchy designs aimed to performance, reliability and energy consumption

    Get PDF
    Current applications demand larger on-chip memory capacity since off-chip memory accesses be-come a bottleneck. However, if we want to achieve this by scaling down the transistor size of SRAM-based Last-Level Caches (LLCs) it may become prohibitive in terms of cost, area and en-ergy. Therefore, other technologies such as STT-RAM are becoming real alternatives to build the LLC in multicore systems. Although STT-RAM bitcells feature high density and low static power, they suffer from other trade-offs. On the one hand, STT-RAM writes are more expensive than STT-RAM reads and SRAM writes. In order to address this asymmetry, we will propose microarchitectural techniques to minimize the number of write operations on STT-RAM cells. On the other hand, reliability also plays an important role. STT-RAM cells suffer from three types of errors: write, read disturbance, and retention errors. Regarding this, we will suggest tech-niques to manage redundant information allowing error detection and information recovery.Postprint (published version

    Compression-aware and performance-efficient insertion policies for long-lasting hybrid LLCs

    Get PDF
    Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17× compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.This work was partially funded by the HiPEAC collaboration grant 2020, the Center for Advancing Electronics Dresden (cfaed), the German Research Council (DFG) through the HetCIM project (502388442) under the Priority Program on ‘Disruptive Memory Technologies’ (SPP 2377), and from grants (1) PID2019-105660RB-C21 and PID2019-107255GB- C22/AEI/10.13039/501100011033 from Agencia Estatal de Investigación (AEI), and (2) gaZ: T5820R research group from Dept. of Science, University and Knowledge Society, Government of Aragon.Peer ReviewedPostprint (author's final draft

    Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache

    Get PDF
    Non-volatile memory (NVM) technologies are interesting alternatives for building on-chip Last-Level Caches (LLCs). Their advantages, compared to SRAM memory, are higher density and lower static power, but each write operation slightly wears out the bitcell, to the point of losing its storage capacity. In this context, this paper summarizes three contributions to the state-of-the-art NVM-based LLCs. Data compression reduces the size of the blocks and, together with wear-leveling mechanisms, can defer the wear out NVMs. Moreover, as capacity is reduced by write wear, data compression enables degraded cache frames to allocate blocks whose compressed size is adequate. Our first contribution is a microarchitecture design that leverages data compression and an intra-frame wear-leveling to gracefully deal with NVM-LLCs capacity degradation. The second contribution leverages this microarchitecture design to propose new insertion policies for hybrid LLCs using Set Dueling and taking into account the compression capabilities of the blocks. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of a NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, the third contribution is a forecasting procedure that combines detailed simulation and prediction, enabling an accurate analysis of different cache content mechanisms (replacement, wear leveling, compression, etc.) on the temporal evolution of the performance of multiprocessor systems employing such NVM-LLCs. Using this forecasting procedure we show that the proposed NVM-LLCs organizations and the insertion policies for hybrid LLCs significantly outperform the state-of-the-art in both performance and lifetime metrics.Peer ReviewedPostprint (published version

    Berti: An Accurate Local-Delta Data Prefetcher

    Get PDF
    Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses. In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas. Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy

    STT-RAM memory hierarchy designs aimed to performance, reliability and energy consumption

    No full text
    Current applications demand larger on-chip memory capacity since off-chip memory accesses be-come a bottleneck. However, if we want to achieve this by scaling down the transistor size of SRAM-based Last-Level Caches (LLCs) it may become prohibitive in terms of cost, area and en-ergy. Therefore, other technologies such as STT-RAM are becoming real alternatives to build the LLC in multicore systems. Although STT-RAM bitcells feature high density and low static power, they suffer from other trade-offs. On the one hand, STT-RAM writes are more expensive than STT-RAM reads and SRAM writes. In order to address this asymmetry, we will propose microarchitectural techniques to minimize the number of write operations on STT-RAM cells. On the other hand, reliability also plays an important role. STT-RAM cells suffer from three types of errors: write, read disturbance, and retention errors. Regarding this, we will suggest tech-niques to manage redundant information allowing error detection and information recovery

    Pronóstico de capacidad efectiva y prestaciones en una cache no volátil de último nivel

    No full text
    La degradación debida a las escrituras que sufren las bitcells implementadas con tecnologi´as de memoria no volátil (NVM) es uno de los principales escollos que se presentan a la hora de construir la cache de último nivel (LLC) con estas tecnologi´as. Aunque en la literatura se recogen diferentes propuestas para hacer frente a esta degradación, la metodologi´a usada en los trabajos previos no permite estudiar en detalle la evolución de la capacidad efectiva ni de las prestaciones en una cache no volátil de último nivel (NV-LLC). Por ello, en este trabajo se propone un procedimiento de pronóstico que combina simulación y predicción con el objetivo de estudiar dicha evolución. Por otra parte, la compresión es una de las técnicas propuestas en la literatura para lidiar con la degradación de las memorias no volátiles. En primer lugar, la compresión diminuye la cantidad de información escrita en una NV-LLC. En segundo lugar, cuando los contenedores ven mermada su capacidad debido a la degradación, la compresión permite mantener su funcionalidad albergando bloques de tamaño reducido. El procedimiento de pronóstico desarrollado en este trabajo permite evaluar el impacto de diferentes técnicas y mecanismos de gestión de contenidos en la esperanza de vida y las prestaciones de una NV-LLC de manera detallada. El mecanismo de compresión adoptado en este trabajo multiplica hasta por 5 veces la esperanza de vida de una NV-LLC.Este trabajo ha sido financiado por MINECO/AEI/FEDER (UE) (proyectos PID2019-105660RB-C21 y PID2019-107255GB-C22 / AEI / 10.13039/501100011033), el Gobierno de Aragon (Grupo T58 20R) y FEDER 2014-2020 “Construyendo Europa desde Aragón”.Peer ReviewedPostprint (published version

    HyCSim: A rapid design space exploration tool for emerging hybrid last-level caches

    No full text
    Recent years have seen a rising trend in the exploration of non-volatile memory (NVM) technologies in the memory subsystem. Particularly in the cache hierarchy, hybrid last-level cache (LLC) solutions are proposed to meet the wide-ranging performance and energy requirements of modern days applications. These emerging hybrid solutions need simulation and detailed exploration to fully understand their capabilities before exploiting them. Existing simulation tools are either too slow or incapable of prototyping such systems and optimizing for NVM devices. To this end, we propose HyCSim, a trace-driven simulation infrastructure that enables rapid comparison of various hybrid LLC configurations for different optimization objectives. Notably, HyCSim makes it possible to quickly estimate the impact of various hybrid LLC insertion and replacement policies, disabling of a cache region at byte or cache frame granularity for different fault maps. In addition, HyCSim allows to evaluate the impact of various compression schemes on the overall performance (hit and miss rate) and the number of writes to the LLC. Our evaluation on ten multi-program workloads from the SPEC 2006 benchmarks suite shows that HyCSim accelerates the simulation time by 24×, compared to the cycle-accurate Gem5 simulator, with high-fidelity.This work was partially funded by the HiPEAC collaboration grant 2020, the German Research Council (DFG) through the TraceSymm project (366764507) and the Co4RTM project (450944241), MCIN/AEI/10.13039/501100011033 (grants PID2019-105660RB-C21 and PID2019- 107255GB-C22), and by Aragón Government (T5820R research group).Peer ReviewedPostprint (published version
    corecore